The following case study is intended to test how you would structure a solution for demand forecasting.
You have been given a dataset for a retailer X who has stores across the country. Please suggest an approach or design a mix of approaches for forecasting sales at the category store week level.
Data available :
1). Macro_Economic_Data - Outlet YEARWK Temp Gas_price_in_dollars
Consumer_price_Index UnEmployment_rate
2). Outlets - Outlet Outlet_Type Floor_Size
3). Holiday_Calendar - YEARWK Is_A_Holiday_Flag
4). Heirarchy - Dept Category
5). Train data - Outlet Category YEARWK Weekly_Sales
6). Test Data - Store Category YEARWK
There may be different types of time series - Stable, Sporadic, With lesser weeks of history etc. Please walk us through your approach to solve the exercise for both short term forecasting (<=6 weeks) and long term forecasting (6+ weeks)
You may use the colab notebook to create charts, do EDA etc to prepare your final forecasting data.
Please do appropriate research to design the solution. More than one type of modeling may need to be used to solve the case study depending on the following:
Please highlight relevant research/EDA (if any) to support your solution design.
While the data is provided to support the modeling exercise end to end, it is not expected that you develop the models. That is optional.
The case study will be evaluated on the solution design and suggested approaches.
The output should be the python code for all the EDA, time series plots etc to arrive at the design of the overall solution and the overall recommendation of the solution .
The deepnote notebook can be used to do all the EDA you want to do.
# import pandas as pd
# df = pd.read_csv('Data_Forecasting/Train_data.csv')
# df
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-1-2c8878b3e3f4> in <module>() 1 import pandas as pd ----> 2 df = pd.read_csv('Data_Forecasting/Train_data.csv') 3 df /usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) 312 313 return wrapper /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 584 kwds.update(kwds_defaults) 585 --> 586 return _read(filepath_or_buffer, kwds) 587 588 /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 480 481 # Create the parser. --> 482 parser = TextFileReader(filepath_or_buffer, **kwds) 483 484 if chunksize or iterator: /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 809 self.options["has_index_names"] = kwds["has_index_names"] 810 --> 811 self._engine = self._make_engine(self.engine) 812 813 def close(self): /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, engine) 1038 ) 1039 # error: Too many arguments for "ParserBase" -> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg] 1041 1042 def _failover_to_python(self): /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds) 49 50 # open handles ---> 51 self._open_handles(src, kwds) 52 assert self.handles is not None 53 /usr/local/lib/python3.7/dist-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds) 227 memory_map=kwds.get("memory_map", False), 228 storage_options=kwds.get("storage_options", None), --> 229 errors=kwds.get("encoding_errors", "strict"), 230 ) 231 /usr/local/lib/python3.7/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 705 encoding=ioargs.encoding, 706 errors=errors, --> 707 newline="", 708 ) 709 else: FileNotFoundError: [Errno 2] No such file or directory: 'Data_Forecasting/Train_data.csv'
Author: Anish Mahapatra
Email: anishmahapatra01@gmail.com
Aim
Create a demand forecasting approach or mix of approaches for forecasting sales at a store-category-week level
To do this, we have been given five relevant datasets
Now, we need to understand what are the possible approaches we can take-up with this.
Join relevant tables (Train_data, Macro_Economic_Data, Holiday_Calendar & Outlets)
Perform data investigation
(Made by Author - Anish via draw.io)

# Installing the required packages
! pip uninstall -y pandas-profiling &> /dev/null # Package for pandas profiling - visualization
! pip install pandas-profiling[notebook,html] &> /dev/null # Uninstalling and reinstalling it due to a bug in Google Colab
! pip install sweetviz &> /dev/null # Package for some sweet visualizations
! pip install -U dataprep &> /dev/null # Package for instant data preparation
! pip install --upgrade plotly &> /dev/null # Updating plotly to the latest version
! pip install jupyter-dash &> /dev/null # Installing jupyter-dash
! pip install pydotplus &> /dev/null # Visualization library
! pip install eli5 &> /dev/null
! pip install lime &> /dev/null
! pip3 install catboost &> /dev/null
# Importing the required packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas_profiling # Automatic EDA
import sweetviz as sv # Importing sweetviz for some sweet visualizations
import seaborn as sns # Importing seaborn for visualization
import pandas as pd # Importing pandas
import numpy as np # Importing numpy
%matplotlib inline
import warnings # Importing package to toggle warnings
import IPython # Importing ipython for displaying html files in the notebook
import seaborn as sns
import pandas as pd
import numpy as np
import datetime
import os
# Hide warnings
import warnings
warnings.filterwarnings('ignore')
# Removing the minimum display columns to 500
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
holiday_calendar = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/holiday_calendar.csv')
Macro_Economic_Data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Macro_Economic_Data.csv')
Outlets = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Outlets.csv')
Train_Data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Train_data.csv')
Test_data = pd.read_csv('https://raw.githubusercontent.com/anishmahapatra/DataScienceInterviewPrep/main/Antuit.ai/FORECASTING-CASE-STUDY-RETAILER-X/Data_Forecasting/Test_data.csv')
Train_Data.head()
| Outlet | Category | YEARWK | Weekly_Sales | |
|---|---|---|---|---|
| 0 | 1 | 1 | 201006 | 24930.50 |
| 1 | 1 | 1 | 201007 | 46045.49 |
| 2 | 1 | 1 | 201008 | 41601.55 |
| 3 | 1 | 1 | 201009 | 19409.54 |
| 4 | 1 | 1 | 201010 | 21833.90 |
Macro_Economic_Data.head()
| Outlet | YEARWK | Temp | Gas_price_in_dollars | Consumer_price_Index | UnEmployment_rate | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 201006.0 | 43.36 | 2.662 | 212.096358 | 7.606 |
| 1 | 1.0 | 201007.0 | 39.56 | 2.638 | 212.242170 | 7.606 |
| 2 | 1.0 | 201008.0 | 40.98 | 2.604 | 212.289143 | 7.606 |
| 3 | 1.0 | 201009.0 | 47.68 | 2.651 | 212.319643 | 7.606 |
| 4 | 1.0 | 201010.0 | 47.55 | 2.715 | 212.350143 | 7.606 |
trainData_macroEconomicData = pd.merge(Train_Data, Macro_Economic_Data, on=['Outlet', 'YEARWK'], how='left')
trainData_macroEconomicData.head()
| Outlet | Category | YEARWK | Weekly_Sales | Temp | Gas_price_in_dollars | Consumer_price_Index | UnEmployment_rate | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 201006 | 24930.50 | 43.36 | 2.662 | 212.096358 | 7.606 |
| 1 | 1 | 1 | 201007 | 46045.49 | 39.56 | 2.638 | 212.242170 | 7.606 |
| 2 | 1 | 1 | 201008 | 41601.55 | 40.98 | 2.604 | 212.289143 | 7.606 |
| 3 | 1 | 1 | 201009 | 19409.54 | 47.68 | 2.651 | 212.319643 | 7.606 |
| 4 | 1 | 1 | 201010 | 21833.90 | 47.55 | 2.715 | 212.350143 | 7.606 |
trainData_macroEconomicData_holidayData = pd.merge(trainData_macroEconomicData, holiday_calendar, on=['YEARWK'], how='left')
trainData_macroEconomicData_holidayData.head()
| Outlet | Category | YEARWK | Weekly_Sales | Temp | Gas_price_in_dollars | Consumer_price_Index | UnEmployment_rate | Is_A_Holiday_Flag | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 201006 | 24930.50 | 43.36 | 2.662 | 212.096358 | 7.606 | False |
| 1 | 1 | 1 | 201007 | 46045.49 | 39.56 | 2.638 | 212.242170 | 7.606 | True |
| 2 | 1 | 1 | 201008 | 41601.55 | 40.98 | 2.604 | 212.289143 | 7.606 | False |
| 3 | 1 | 1 | 201009 | 19409.54 | 47.68 | 2.651 | 212.319643 | 7.606 | False |
| 4 | 1 | 1 | 201010 | 21833.90 | 47.55 | 2.715 | 212.350143 | 7.606 | False |
trainData_macroEconomicData_holidayData_outlet = pd.merge(trainData_macroEconomicData_holidayData, Outlets, on=['Outlet'], how='left')
trainData_macroEconomicData_holidayData_outlet.head()
| Outlet | Category | YEARWK | Weekly_Sales | Temp | Gas_price_in_dollars | Consumer_price_Index | UnEmployment_rate | Is_A_Holiday_Flag | Outlet_Type | Floor_Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 201006 | 24930.50 | 43.36 | 2.662 | 212.096358 | 7.606 | False | A | 302630 |
| 1 | 1 | 1 | 201007 | 46045.49 | 39.56 | 2.638 | 212.242170 | 7.606 | True | A | 302630 |
| 2 | 1 | 1 | 201008 | 41601.55 | 40.98 | 2.604 | 212.289143 | 7.606 | False | A | 302630 |
| 3 | 1 | 1 | 201009 | 19409.54 | 47.68 | 2.651 | 212.319643 | 7.606 | False | A | 302630 |
| 4 | 1 | 1 | 201010 | 21833.90 | 47.55 | 2.715 | 212.350143 | 7.606 | False | A | 302630 |
df = trainData_macroEconomicData_holidayData_outlet.copy(deep=True)
df['YEARWK'] = df['YEARWK'].astype(str)
df['Year'] = df['YEARWK'].str[:4]
df['Week'] = df['YEARWK'].str[4:]
# Viewing the shape of the data - (rows, columns)
df.shape
(382261, 13)
# Summary of the dataset
print(df.info(verbose=True))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 382261 entries, 0 to 382260 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 df_index 382261 non-null int64 1 Outlet 382261 non-null int64 2 Category 382261 non-null int64 3 YEARWK 382261 non-null object 4 Weekly_Sales 382261 non-null float64 5 Temp 337992 non-null float64 6 Gas_price_in_dollars 337992 non-null float64 7 Consumer_price_Index 337992 non-null float64 8 UnEmployment_rate 307424 non-null float64 9 Is_A_Holiday_Flag 382261 non-null bool 10 Outlet_Type 382261 non-null object 11 Floor_Size 382261 non-null int64 12 Year 382261 non-null object 13 Week 382261 non-null object 14 week_date 382261 non-null datetime64[ns] dtypes: bool(1), datetime64[ns](1), float64(5), int64(4), object(4) memory usage: 41.2+ MB None
# Printing all the columns with atleast one null value
df.columns[df.isna().any()].tolist()
['Temp', 'Gas_price_in_dollars', 'Consumer_price_Index', 'UnEmployment_rate']
print("The number of columns with atleast one NULL value(s) are: ", len(df.columns[df.isna().any()].tolist()))
The number of columns with atleast one NULL value(s) are: 4
# Visulalizing the percentage of missing values
missing = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Feature': df.columns,
'missing (%)': round(missing,2)})
missing_df
| Feature | missing (%) | |
|---|---|---|
| Outlet | Outlet | 0.00 |
| Category | Category | 0.00 |
| YEARWK | YEARWK | 0.00 |
| Weekly_Sales | Weekly_Sales | 0.00 |
| Temp | Temp | 11.58 |
| Gas_price_in_dollars | Gas_price_in_dollars | 11.58 |
| Consumer_price_Index | Consumer_price_Index | 11.58 |
| UnEmployment_rate | UnEmployment_rate | 19.58 |
| Is_A_Holiday_Flag | Is_A_Holiday_Flag | 0.00 |
| Outlet_Type | Outlet_Type | 0.00 |
| Floor_Size | Floor_Size | 0.00 |
| Year | Year | 0.00 |
| Week | Week | 0.00 |
df.isnull().sum()
Outlet 0 Category 0 YEARWK 0 Weekly_Sales 0 Temp 44269 Gas_price_in_dollars 44269 Consumer_price_Index 44269 UnEmployment_rate 74837 Is_A_Holiday_Flag 0 Outlet_Type 0 Floor_Size 0 Year 0 Week 0 dtype: int64
df = df.reset_index()
df['week_date']=(df[['Year','Week']].astype(str)
.apply(lambda x:datetime.datetime.strptime('-W'.join(x) + '-1', "%Y-W%W-%w"),1))
from pandas_profiling import ProfileReport # Pandas Profile to visualize the data
# Generating the profile report and feeding it into a variable
Profile = ProfileReport(df, title = 'Pandas Profiling Report', html = {'style':{'full_width':True}})
# Showcasing the Pandas Profiling Report for the Zomato Dataset
Profile
# We shall use the SweetViz analyze() function to generate a shareable .html file to analyze the dataset
sweetvizReport = sv.analyze(df)
sweetvizReport.show_html('SweetViz.html', open_browser=False)
Report SweetViz.html was generated.
# Displaying the html file in the Google Colab notebook
IPython.display.HTML('SweetViz.html')
Data Understanding:
df = trainData_macroEconomicData_holidayData_outlet.copy(deep=True)
df['YEARWK'] = df['YEARWK'].astype(str)
df['Year'] = df['YEARWK'].str[:4]
df['Week'] = df['YEARWK'].str[4:]
df = df.reset_index()
df['week_date']=(df[['Year','Week']].astype(str)
.apply(lambda x:datetime.datetime.strptime('-W'.join(x) + '-1', "%Y-W%W-%w"),1))
df_mod = df.copy(deep=True)
df.head()
| index | Outlet | Category | YEARWK | Weekly_Sales | Temp | Gas_price_in_dollars | Consumer_price_Index | UnEmployment_rate | Is_A_Holiday_Flag | Outlet_Type | Floor_Size | Year | Week | week_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 201006 | 24930.50 | 43.36 | 2.662 | 212.096358 | 7.606 | False | A | 302630 | 2010 | 06 | 2010-02-08 |
| 1 | 1 | 1 | 1 | 201007 | 46045.49 | 39.56 | 2.638 | 212.242170 | 7.606 | True | A | 302630 | 2010 | 07 | 2010-02-15 |
| 2 | 2 | 1 | 1 | 201008 | 41601.55 | 40.98 | 2.604 | 212.289143 | 7.606 | False | A | 302630 | 2010 | 08 | 2010-02-22 |
| 3 | 3 | 1 | 1 | 201009 | 19409.54 | 47.68 | 2.651 | 212.319643 | 7.606 | False | A | 302630 | 2010 | 09 | 2010-03-01 |
| 4 | 4 | 1 | 1 | 201010 | 21833.90 | 47.55 | 2.715 | 212.350143 | 7.606 | False | A | 302630 | 2010 | 10 | 2010-03-08 |
df_year = df_mod.set_index(['Year'])
df_week_date = df_mod.set_index(['week_date'])
Analyzing weekly sales aggregated at a year level (mean, sum)
plt.plot(df_year.groupby(df_year.index)["Weekly_Sales"].mean())
[<matplotlib.lines.Line2D at 0x7fdd86355a50>]
plt.plot(df_year.groupby(df_year.index)["Weekly_Sales"].sum())
[<matplotlib.lines.Line2D at 0x7fdd8577ed50>]
Analyzing weekly sales aggregated at a week level (mean, sum)
plt.plot(df_week_date.groupby(df_week_date.index)["Weekly_Sales"].mean())
[<matplotlib.lines.Line2D at 0x7fdd8675e990>]
plt.plot(df_week_date.groupby(df_week_date.index)["Weekly_Sales"].sum())
[<matplotlib.lines.Line2D at 0x7fdd8588e310>]
Now that we have done the preliminary Ecploratory Data Analysis, let's list down some of our observations:
Back to Table of Contents
Based on the problem statement of the case-study, marks are awarded for the approach, rather than the modelling.
The metrics to consider are:
So, below is the methodology we should follow to perform:
(Made by Author - Anish via draw.io)

Here, it is critical to note that the stakeholdr expect the data at a category-store-week level.
We will compare this with the baseline model that we have and understand if we are getting better results and why.
Based on this, we will decide the following:
One thing to note here is we can predict for a longer time or more accurately if we had more sales history data.
We have commented on sales patterns in our obervations and derived addtional features such as lags in our recommendations.